WineQualityReds by Farnaz_Motamediyan

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599
## [1] 13
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

RedWineData dataset has 13 variables of 1599 entries: “fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”, “free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”, “quality” Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best).

Univariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  RedWineData$quality and RedWineData$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  quality and fixed.acidity
## t = -0.62012, df = 215, p-value = 0.5358
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17445680  0.09144487
## sample estimates:
##         cor 
## -0.04225416

As it is illustrated above in the charts and histograms of variables, there are some outliners that stand out in the chart. Both distributions for fixed acidity and volatile acidity have long positive tails, this makes their mean higher than their medians, and make median better measure of central value. Citric acid distribution looks slightly bimodal and there are few outliers as well. one intresting thing that i noticed is unsual spikes around 0.0 g/dm^3 and 0.5 g/dm^3, this may indicate few concentrations are more common than others. Residual sugar is highly positively skewed. In addition, the plot contains two peaks, it is visioble commonly in lots of plots, could be mainly due to wine type. Density and sulphates ditributions, like others, has long tails. Alcohol distribution is slightly skewed. the mean and median values of alcohol distribution are almost same. The minimum of alcohol in all wines on our dataset is 8%.

We can even remove outliers if we find it appropriate, that will make the following analysis more robust. http://www.public.iastate.edu/~maitra/stat501/lectures/Outliers.pdf However, at this stage I decided to include outliners in order for the reader to have a better underestanding of our current dataset.

Univariate Analysis

I myself am a Red Wine lover. In this dataset first variable that caught my interest was the quality of wine. It is usually ranked from 1 to 10, however, in this dataset is ranked from 1 to 8. Here in the bellow histogram, I categorized the data set by good(>=7) average(5 =< Avg <7) and poor(<5) wine based on its quality.

##    poor average    good 
##      63    1319     217

Distributions and Outliers

Looking at the first rounds of histograms that I created to look at the variables, it appears that Qualitatively, residual sugar and chlorides have extreme outliers. Citric has a large number of zero values, however, i’m wondering whether this is a case of non-reporting of values. It looks like density and pH are normally distributed, with only few outliers. sulfur dioxides, fixed and volatile acidity, sulphates, and alcohol seem to be long-tailed.

Arguments Log10: log is a character string which contains “x” if the x axis is to be logarithmic. here I used log10

ggplot(data = RedWineData,
       aes(x = citric.acid)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = RedWineData,
       aes(x = fixed.acidity)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = RedWineData,
       aes(x = volatile.acidity)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

we can see that fixed.acidity and volatile.acidity appear to be normally-distributed.

I created a new variable as qualityrating to rate the quality of wine as good average and poor.

I addressed the distributions in the ‘Distributions and outliers’ section. Here I will continue to visualize the outliers by Boxplots. But it is important to mention that at the end I did not perform any operations on the data to tidy or adjust or change the form of the data here at Univariate Analysis section. However, I will do so with the Bivariable section. In order to make visualization by boxplots, first I define a new function to create boxplot for each variable.

# Bivariate Plots Section

Bivariate Analysis

It showes at the above plots that a good wine has a lower pH , higher alcohol and higher acidic (all three acidic kind examined here), as an example.

some statistics on quality of wines. the question to be answered is, which are the lements that have most influence in the quality of wine?

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

and here is the result of correlations: fixed.acidity volatile.acidity citric.acid log10.residual.sugar 0.12405165 -0.39055778 0.22637251 0.02353331 log10.chlordies free.sulfur.dioxide total.sulfur.dioxide density -0.17613996 -0.05065606 -0.18510029 -0.17491923 pH log10.sulphates alcohol -0.05773139 0.30864193 0.47616632

As it appears, alcohol, sulphates, volatile acidity and citric acide have the highest correlations to quality.

Multivariate Plots Section

Multivariate Analysis

Because the plot could get very tense, crowded and hard to read, I used facet_wrap(~qualityrating) to reduce the crowdedness of my models and plots. I tried to identify and illustrate the 4 most correlated features to the quality of wines. As it is shown, higher citric acid and lower volatile acid could produce a better (in terms of quality) wine. higher sulphates and higher alcohol(%) also shared contribution to a more high-quality wine.


Final Plots and Summary

Plot One

Description One

I have done rating as good, average and poor. I named the new variable quality rating. in this scattered plot I neglected the average rate and only considered good and poor quality wines. I initially wanted to see the correlation of alcohol and volatile acidity on the quality of wine. As it is visible, higher volatile acidity brings down the quality of wine. Higher quality wines also tend to have higher alcohol. as a result, i can see that higher percentage of alcohol combined with lower volatile acidity produced a better wine in terms of quality.

Here I will try to visualise the correlation of each of my variables(in the first plot) with quality in order to support my argument.

Plot Two

Description Two

Here in these four boxplots we can see the effect of alcohol in the quality of wine. I personaly don’t like the argument of ‘higher alcohol results in higher quality of wine’. In fact I argue that the combination of higher alcohol and other factors will produce higher qualities, and in here this case is the acidic combination. and the outliers here also showes that alcohol by itself alone is not a strong indicator of high quality wine.

Plot Three

## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

Description Three

As I have identified it at the first plot here, higher acidity or lower pH is recognized in higher quality wines. There are four colorful plots that illustrate correlation between the quality of wine and alcohol and PH,Citric Acid,Fixed Acidity and Volatile Acidity.

Reflection

This EDA or exploratory data analysis helped me gain insights about the redwinequality dataset. i was able to visualize relationships and correlations of different variables. I was able to identify the most related variables to the quality of red wine. As it appears, alcohol, sulphates, volatile acidity and citric acide have the highest correlations to the quality of red wines. I am interested to discover quality patterns of white wines to see if they share some similarities.

Extra information on Variable description

Description of attributes:

Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste Citric acid: found in small quantities, citric acid can add ???freshness??? and flavor to wines Residual sugar: the amount of sugar remaining after fermentation stops, it???s rare to find wines with less than 1 gram/liter and wines with greater then 45 grams/liter are considered sweet Chlorides: the amount of salt in the wine Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of the wine Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine Density: the density of water is close to that of water depending on the percent alcohol and sugar content pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant Alcohol: the percent alcohol content of the wine Quality: output variable (based on sensory data, score between 0 and 10)

I also would like to expand a little bit on the struggles and successes through the analysis. working with R studio was indeed the easiest project throughout this Nanodegree. I had a very smooth time completing this project specially compare to OpenStreetMap. Having said that, there were few logical arguments that I needed to read extensively on the internet or receive help from mentors to understand. especially using factor element (color = factor(quality)) for the Multivariate the analysis helped me a lot to better describe my plots and better explain my arguments.

resources: https://www.bbr.com/wine-knowledge/faq-quality https://en.wikipedia.org/wiki/Red_wine https://www.rstudio.com